Parsing the Arabic Treebank: Analysis and Improvements
نویسندگان
چکیده
Previous work has demonstrated that the performance of current parsers on Arabic is far below their performance on English or even Chinese, which in turn harms performance on NLP tasks that use parsing as an input. This paper is an exploration of some of the issues involved in this difference. We focus on the Collins parsing model [3] as implemented in the Bikel parser [1]. The corpus used for the experiments is the Arabic Treebank [6] (ATB). We cluster these issues in three ways. First, it is important when comparing Arabic parsing performance to other languages that the comparison be a fair one; therefore we first discuss some issues around evaluation and show that current Arabic parsing performance is not quite as bad as previously thought. Second, we present some modifications to the parser which provide modest increases in performance. Finally, we explore deeper differences between the Arabic Treebank and the Penn Treebank and advance some speculations as to why parsers have difficulty with Arabic.
منابع مشابه
Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines
The Arabic Treebank team at the Linguistic Data Consortium has significantly revised and enhanced its annotation guidelines and procedure over the past year. Improvements were made to both the morphological and syntactic annotation guidelines, and annotators were trained in the new guidelines, focusing on areas of low inter-annotator agreement. The revised guidelines are now being applied in an...
متن کاملTurkish Treebank as a Gold Standard for Morphological Disambiguation and Its Influence on Parsing
So far predicted scenarios for Turkish dependency parsing have used a morphological disambiguator that is trained on the data distributed with the tool(Sak et al., 2008). Although models trained on this data have high accuracy scores on the test and development data of the same set, the accuracy drastically drops when the model is used in the preprocessing of Turkish Treebank parsing experiment...
متن کاملUtilizing State-of-the-art Parsers to Diagnose Problems in Treebank Annotation for a Less Resourced Language
The recent success of statistical parsing methods has made treebanks become important resources for building good parsers. However, constructing highquality annotated treebanks is a challenging task. We utilized two publicly available parsers, Berkeley and MST parsers, for feedback on improving the quality of part-of-speech tagging for the Vietnamese Treebank. Analysis of the treebank and parsi...
متن کاملSyntactic Analysis of the Tunisian Arabic
In this paper, we study the problem of syntactic analysis of Dialectal Arabic (DA). Actually, corpora are considered as an important resource for the automatic processing of languages. Thus, we propose a method of creating a treebank for the Tunisian Arabic (TA) “Tunisian Treebank” in order to adapt an Arabic parser to treat the TA which is considered as a variant of the Arabic language.
متن کاملBetter Arabic Parsing: Baselines, Evaluations, and Analysis
In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design. First, we identify sources of syntactic ambiguity understudied in the existing parsing literature. Second, we show that although the Penn Arabic Treebank is similar to other treebanks in gross statistical terms, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006